Problem Statement

  • Buying a house requires a lot of careful planning. Once we have finalized your budget and the house that we want to buy, we must ensure sufficient funds to pay the seller.

  • With rising property rates, most people avail home loans to buy their dream houses. The bank only lends up to 80% of the total amount based on a person's finances (salary, outgoing expenses, existing loans, etc.). We will need to make the rest of the payment yourself after the bank tells you how much they can lend.

  • We need to predicting loan amount that can be sanctioned to applicant based on available features.

Data Description

Columns Column Description
Customer ID Represents a unique identification number of a customer
Name Represents the name of a customer
Gender Represents the gender of a customer
Age Represents the age of a customer
Income (USD) Represents the income of a customer
Income Stability Represents whether a customer has a stable source of income
Profession Represents the profession of a customer
Type of Employment Represents the type of employment of a customer
Location Represents the current location that a customer resides
Loan Amount Request (USD) Represents the loan amount requested by a customer
Current Loan Expenses (USD) If a customer has any current active loans, then this represents the amount that a spends on these loans (monthly)
Expense Type 1 Represents a type of expense that a customer spends on (monthly)
Expense Type 2 Represents a type of expense that a customer spends on (monthly)
Dependents Represents whether a customer has any dependencies (spouse, parents, siblings, children, etc.)
Credit Score Represents the credit score of a customer
No. of Defaults Represents the number of time a customer has defaulted
Has Active Credit Card Represents if a customer has any active credit cards or not
Property ID Represents an identification number of a property
Property Age Represents the age of a property
Property Type Represents the type of property
Property Location Represents the location of a property
Co-Applicant Represents whether a customer has co-applicants
Property Price Represents the selling price of a property
Loan Sanction Amount (USD) Represents the loan sanctioned amount for a customer

Importing required libraries

In [1]:
import pandas as pd
import plotly.express as px

Reading data

In [2]:
df = pd.read_csv('train.csv', na_values=[-999, '?', ' '])
print(df.shape)
df.head()
(30000, 24)
Out[2]:
Customer ID Name Gender Age Income (USD) Income Stability Profession Type of Employment Location Loan Amount Request (USD) ... Credit Score No. of Defaults Has Active Credit Card Property ID Property Age Property Type Property Location Co-Applicant Property Price Loan Sanction Amount (USD)
0 C-36995 Frederica Shealy F 56 1933.05 Low Working Sales staff Semi-Urban 72809.58 ... 809.44 0 NaN 746 1933.05 4 Rural 1.0 119933.46 54607.18
1 C-33999 America Calderone M 32 4952.91 Low Working NaN Semi-Urban 46837.47 ... 780.40 0 Unpossessed 608 4952.91 2 Rural 1.0 54791.00 37469.98
2 C-3770 Rosetta Verne F 65 988.19 High Pensioner NaN Semi-Urban 45593.04 ... 833.15 0 Unpossessed 546 988.19 2 Urban 0.0 72440.58 36474.43
3 C-26480 Zoe Chitty F 65 NaN High Pensioner NaN Rural 80057.92 ... 832.70 1 Unpossessed 890 NaN 2 Semi-Urban 1.0 121441.51 56040.54
4 C-23459 Afton Venema F 31 2614.77 Low Working High skill tech staff Semi-Urban 113858.89 ... 745.55 1 Active 715 2614.77 4 Semi-Urban 1.0 208567.91 74008.28

5 rows × 24 columns

Columns data types

In [3]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 24 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Customer ID                  30000 non-null  object 
 1   Name                         30000 non-null  object 
 2   Gender                       29947 non-null  object 
 3   Age                          30000 non-null  int64  
 4   Income (USD)                 25424 non-null  float64
 5   Income Stability             28317 non-null  object 
 6   Profession                   30000 non-null  object 
 7   Type of Employment           22730 non-null  object 
 8   Location                     30000 non-null  object 
 9   Loan Amount Request (USD)    30000 non-null  float64
 10  Current Loan Expenses (USD)  29651 non-null  float64
 11  Expense Type 1               30000 non-null  object 
 12  Expense Type 2               30000 non-null  object 
 13  Dependents                   27507 non-null  float64
 14  Credit Score                 28297 non-null  float64
 15  No. of Defaults              30000 non-null  int64  
 16  Has Active Credit Card       28434 non-null  object 
 17  Property ID                  30000 non-null  int64  
 18  Property Age                 25150 non-null  float64
 19  Property Type                30000 non-null  int64  
 20  Property Location            29644 non-null  object 
 21  Co-Applicant                 29832 non-null  float64
 22  Property Price               29648 non-null  float64
 23  Loan Sanction Amount (USD)   29322 non-null  float64
dtypes: float64(9), int64(4), object(11)
memory usage: 5.5+ MB

Null values percentage in each columns

In [4]:
na_percentage = (df.isna().sum() / df.shape[0]) * 100
na_percentage
Out[4]:
Customer ID                     0.000000
Name                            0.000000
Gender                          0.176667
Age                             0.000000
Income (USD)                   15.253333
Income Stability                5.610000
Profession                      0.000000
Type of Employment             24.233333
Location                        0.000000
Loan Amount Request (USD)       0.000000
Current Loan Expenses (USD)     1.163333
Expense Type 1                  0.000000
Expense Type 2                  0.000000
Dependents                      8.310000
Credit Score                    5.676667
No. of Defaults                 0.000000
Has Active Credit Card          5.220000
Property ID                     0.000000
Property Age                   16.166667
Property Type                   0.000000
Property Location               1.186667
Co-Applicant                    0.560000
Property Price                  1.173333
Loan Sanction Amount (USD)      2.260000
dtype: float64
  • Large missing values in Income Stability, Type of Employment, Dependents, Property Age

Numeric features stats description

In [58]:
df.drop(['Customer ID', 'Name', 'Property ID'], axis=1, inplace=True)
In [59]:
df.describe()
Out[59]:
Age Income (USD) Loan Amount Request (USD) Current Loan Expenses (USD) Dependents Credit Score No. of Defaults Property Age Property Price Loan Sanction Amount (USD)
count 30000.000000 2.542400e+04 30000.000000 29651.000000 27507.000000 28297.000000 30000.000000 2.515000e+04 2.964800e+04 29322.000000
mean 40.092300 2.630574e+03 88826.333855 409.293722 2.253027 739.885381 0.193933 2.631119e+03 1.333359e+05 48210.120452
std 16.045129 1.126272e+04 59536.949605 217.738726 0.951162 72.163846 0.395384 1.132268e+04 9.297139e+04 48212.938805
min 18.000000 3.777000e+02 6048.240000 33.760000 1.000000 580.000000 0.000000 3.777000e+02 7.265950e+03 0.000000
25% 25.000000 1.650457e+03 41177.755000 250.745000 2.000000 681.880000 0.000000 1.650450e+03 6.192126e+04 0.000000
50% 40.000000 2.222435e+03 75128.075000 376.530000 2.000000 739.820000 0.000000 2.223250e+03 1.112765e+05 36064.225000
75% 55.000000 3.090593e+03 119964.605000 522.470000 3.000000 799.120000 0.000000 3.091407e+03 1.800713e+05 74779.277500
max 65.000000 1.777460e+06 621497.820000 3840.880000 14.000000 896.260000 1.000000 1.777460e+06 1.077967e+06 481907.320000
In [6]:
print('Unique Property Type', df['Property Type'].unique())
print('Unique Co-Applicant', df['Co-Applicant'].unique())
Unique Property Type [4 2 1 3]
Unique Co-Applicant [ 1.  0. nan]
  • Property Type can be converted to categorical feature as it has distinct 4 values.
  • Co-Applicant can be converted to categorical feature as it has binary value 1 or 0.
In [7]:
df['Property Type'] = df['Property Type'].astype('object')
df['Co-Applicant'] = df['Co-Applicant'].astype('object')

Categorical features stats description

In [8]:
df.describe(include='object')
Out[8]:
Customer ID Name Gender Income Stability Profession Type of Employment Location Expense Type 1 Expense Type 2 Has Active Credit Card Property Type Property Location Co-Applicant
count 30000 30000 29947 28317 30000 22730 30000 30000 30000 28434 30000 29644 29832.0
unique 30000 30000 2 2 8 18 3 2 2 3 4 3 2.0
top C-45505 Leeanna Munsterman M Low Working Laborers Semi-Urban N Y Active 1 Semi-Urban 1.0
freq 1 1 15053 25751 16926 5578 21563 19214 20180 9771 7863 10387 25516.0

Visualization

In [11]:
df['Income Stability'].fillna(df['Income Stability'].mode().values[0], inplace=True)
df['Gender'].fillna(df['Gender'].mode().values[0], inplace=True)
In [12]:
fig = px.histogram(df, x='Loan Sanction Amount (USD)', color='Gender', nbins=50)
fig.show()
In [26]:
fig = px.histogram(df, x='Gender', color='Income Stability', barmode='group')
fig.show()
In [14]:
fig = px.histogram(df, x='Gender', color='Profession', barmode = 'group')
fig.show()
In [16]:
fig = px.histogram(df, x='Type of Employment', color='Gender', barmode='group')
fig.show()
In [17]:
fig = px.histogram(df, x='Has Active Credit Card', color='Gender', barmode='group')
fig.show()
In [19]:
fig = px.histogram(df, x='Property Type', color='Gender', barmode='group')
fig.show()
In [20]:
fig = px.histogram(df, x='Property Location', color='Gender', barmode='group')
fig.show()
In [21]:
fig = px.histogram(df, x='Co-Applicant', color='Gender', barmode='group')
fig.show()
In [32]:
data = pd.DataFrame(df.groupby(['Co-Applicant', 'Gender'])['Loan Sanction Amount (USD)'].mean()).reset_index()
fig = px.bar(data, x='Co-Applicant', y='Loan Sanction Amount (USD)', color='Gender', barmode="group")
fig.show()

Visualization for 0 Loan Sanction Amount (USD)

In [35]:
no_sanction_df = df[df['Loan Sanction Amount (USD)'] == 0]
fig = px.histogram(no_sanction_df, x='Gender', color='Profession', barmode = 'group')
fig.show()
In [38]:
fig = px.histogram(no_sanction_df, x='Type of Employment', color='Gender', barmode = 'group')
fig.show()
In [39]:
fig = px.histogram(no_sanction_df, x='Co-Applicant', color='Gender', barmode = 'group')
fig.show()
In [48]:
fig = px.histogram(no_sanction_df, x='Property Age', barmode='group', nbins=10)
fig.show()
In [49]:
fig = px.histogram(no_sanction_df, x='Property Price', barmode='group', nbins=10)
fig.show()
In [50]:
fig = px.histogram(no_sanction_df, x='Dependents', barmode='group', nbins=10)
fig.show()
In [54]:
fig = px.histogram(no_sanction_df, x='Credit Score', barmode='group', nbins=50)
fig.show()
In [53]:
fig = px.histogram(no_sanction_df, x='No. of Defaults', barmode='group')
fig.show()
In [55]:
fig = px.histogram(no_sanction_df, x='Age', barmode='group')
fig.show()
In [56]:
fig = px.histogram(no_sanction_df, x='Current Loan Expenses (USD)', barmode='group')
fig.show()

Co-relation between numeric feature

In [76]:
data = df.corr()
fig = px.imshow(data, color_continuous_midpoint=0, color_continuous_scale='ice')
fig.show()
In [ ]: